An R Package to Read CCCCO MIS Files
Christian Million
Data Analyst
Yosemite Community College District
comis?An internally developed R package
Read and Format:
MIS Submission Files
MIS Referential Files
Submission Files: Districts submit MIS data to state via these files
- ~ 25 files | 396 elements
Referential Files: Districts retrieve these files from Data-On-Demand
- ~ 27 files | 406 elements
MIS Data is important
We want to easily use it / analyze it
The Challenge: Reading the MIS data into R is difficult and error prone
Fixed Width Format
No Column Names
Numbers that should be characters / dates
Missing values (NA)
Trailing white space
Implied decimal points
Tab Delimited :)
No Column Names
Numbers that should be characters / dates
Missing values (NA)
Trailing white space
Implied decimal points
Different date format than submission file.
Imagine writing code to handle this for each analysis:
A lot to re-remember
Cognitively taxing to implement
Takes time
Updates to multiple scripts
Copy / paste errors
Makes scripts more difficult to read
Unfulfilling
Lots of overhead before analysis can begin
comis# Load Libraries -----
library(dplyr)
library(readr)
# Define Names, Types, and Widths -----
CB_col_names <- c('GI90', 'GI01','GI03', paste0("CB0",0:9), paste0("CB",10:27), "Filler")
CB_col_types <- rep("c", length(CB_col_names))
CB_col_width <- CB <- c(2,3,3,12,12,68,6,1,1,length(109:112),length(113:116),1,1,1,1,1,1,6,8,length(137:148),length(149:160),length(161:172),7,9,1,1,1,1,1,1,1,26)
XB_col_names <- c('GI90', 'GI01', 'GI03', 'GI02', 'CB01', paste0('XB0',0:9), 'XB10', 'XB11', 'XB12', 'CB00', 'Filler')
XB_col_types <- rep("c", length(XB_col_names))
XB_col_width <- c(2,3,3,3,12,6,1,6,6,1,length(44:47), length(48:51),1,1,1,1,length(56:61), 1, 12,7)
# Read the source data -----
CB_src <- readr::read_tsv("path/to/U59223CB.DAT",
col_names = CB_col_names,
col_types = CB_col_types,
trim_ws = TRUE)
XB_src <- readr::read_tsv("path/to/U59223XB.DAT",
col_names = CB_col_names, # copy / paste errors
col_types = XB_col_types,
trim_ws = TRUE)
# Clean and Reformat Data -----
CB <- CB_src |>
mutate(dates = date_cleaning_code(),
units = implicit_decimal_code())
XB <- XB_src |>
mutate(dates = date_cleaning_code(),
units = implicit_decimal_code())comisContains useful data found on CCCCO websites
Read many files at once
Read from repo
Use DED Name or Descriptive Name
comisEasier to tell what’s happening
Reduces cognitive overhead
Get to analysis faster and with more confidence
Documentation contained within the package
Updates made in one spot (instead of throughout various scripts)
Shifts focus to what’s important - Using the Data
Addresses problems specific to the institution
Reasonable defaults
Abstracts common tasks
Maintainable
Easily share code with others
Business logic is located in one place
Christian Million
Data Analyst
Yosemite Community College District
Pier to Pier | 2022-08-25